AITopics | object 2

Collaborating Authors

object 2

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ad02c6f3824f871395112ae71a28eff7-Supplemental-Conference.pdf

Neural Information Processing SystemsAug-17-2025, 15:47:14 GMT

artificial intelligence, object 2, video, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Jiang, Dongzhi, Zhang, Renrui, Guo, Ziyu, Li, Yanwei, Qi, Yu, Chen, Xinyan, Wang, Liuhui, Jin, Jianhan, Guo, Claire, Yan, Shen, Zhang, Bo, Fu, Chaoyou, Gao, Peng, Li, Hongsheng

arXiv.org Artificial IntelligenceFeb-13-2025

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2502.09621

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre:

Workflow (0.68)
Research Report (0.50)

Industry:

Health & Medicine > Consumer Health (1.00)
Education > Health & Safety > School Nutrition (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback

Making Large Multimodal Models Understand Arbitrary Visual Prompts

Cai, Mu, Liu, Haotian, Mustikovela, Siva Karthik, Meyer, Gregory P., Chai, Yuning, Park, Dennis, Lee, Yong Jae

arXiv.org Artificial IntelligenceDec-1-2023

While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.

vip-bench, vip-llava, visual prompt, (13 more...)

arXiv.org Artificial Intelligence

2312.00784

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
North America > Greenland (0.05)
Europe > Iceland (0.05)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

PACS: A Dataset for Physical Audiovisual CommonSense Reasoning

Yu, Samuel, Wu, Peter, Liang, Paul Pu, Salakhutdinov, Ruslan, Morency, Louis-Philippe

arXiv.org Artificial IntelligenceAug-1-2022

In order for AI to be safely deployed in real-world scenarios such as hospitals, schools, and the workplace, it must be able to robustly reason about the physical world. Fundamental to this reasoning is physical common sense: understanding the physical properties and affordances of available objects, how they can be manipulated, and how they interact with other objects. Physical commonsense reasoning is fundamentally a multi-sensory task, since physical properties are manifested through multiple modalities - two of them being vision and acoustics. Our paper takes a step towards real-world physical commonsense reasoning by contributing PACS: the first audiovisual benchmark annotated for physical commonsense attributes. PACS contains 13,400 question-answer pairs, involving 1,377 unique physical commonsense questions and 1,526 videos. Our dataset provides new opportunities to advance the research field of physical reasoning by bringing audio as a core component of this multimodal problem. Using PACS, we evaluate multiple state-of-the-art models on our new challenging task. While some models show promising results (70% accuracy), they all fall short of human performance (95% accuracy). We conclude the paper by demonstrating the importance of multimodal reasoning and providing possible avenues for future research.

artificial intelligence, commonsense reasoning, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2203.1113

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Government (0.46)
Education (0.46)
Health & Medicine (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Commonsense Reasoning (0.92)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)

Add feedback

Unsupervised Learning of Compositional Energy Concepts

Du, Yilun, Li, Shuang, Sharma, Yash, Tenenbaum, Joshua B., Mordatch, Igor

arXiv.org Artificial IntelligenceNov-4-2021

Humans are able to rapidly understand scenes by utilizing concepts extracted from prior experience. Such concepts are diverse, and include global scene descriptors, such as the weather or lighting, as well as local scene descriptors, such as the color or size of a particular object. So far, unsupervised discovery of concepts has focused on either modeling the global scene-level or the local object-level factors of variation, but not both. In this work, we propose COMET, which discovers and represents concepts as separate energy functions, enabling us to represent both global concepts as well as objects under a unified framework. COMET discovers energy functions through recomposing the input image, which we find captures independent factors without additional supervision. Sample generation in COMET is formulated as an optimization process on underlying energy functions, enabling us to generate images with permuted and composed concepts. Finally, discovered visual concepts in COMET generalize well, enabling us to compose concepts between separate modalities of images as well as with other concepts discovered by a separate instance of COMET trained on a different dataset. Code and data available at https://energy-based-model.github.io/comet/.

dataset, energy function, variation, (13 more...)

arXiv.org Artificial Intelligence

2111.03042

Country: